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Abstract 

Information theory provides principled models to analyze different in- 
ference and learning problems such as hypothesis testing, clustering, di- 
mensionality reduction, classification, among others. However, the use of 
information theoretic quantities as test statistics, that is, as quantities 
obtained from empirical data, posses a challenging estimation problem 
that often leads to strong simplifications such as Gaussian models, or the 
use of plug in density estimators that are restricted to certain representa- 
tion of the data. In this paper, a framework to non-parametrically obtain 
measures of entropy directly from data using infinitely divisible kernels is 
presented. In resemblance to quantum information theory, functionals on 
positive definite matrices that satisfy similar properties to the ones given 
in Renyi's axiomatic definition of entropy are defined. Therefore, the esti- 
mation of the probability law underlying the data is avoided, capitalizing 
on the representation power that positive definite kernels bring. In the 
proposed framework, analogues to quantities such as conditional entropy 
and mutual information are obtained. Numerical validation using the 
proposed quantities to test independence is provided. In the considered 
examples, the proposed framework can achieve state of the art perfor- 



1 Introduction 

Operational quantities in information theory are defined based on the probabil- 
ity laws underlying the data generation process of the system under analysis. 
Nevertheless, when learning from data, the only information available comes 
from a sample {zi}^ i=1 , so the use of information theoretic quantities relies on 
an appropriate estimation process. A common approach splits the estimation 
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process into two steps leading to the so called "plug in" estimation. First, the 
data is employed to fit a model of how it is distributed; then, the estimated 
distribution is applied to the definition of the information theoretic quantity. 
Plug in estimation is very intuitive and straightforward, but its behavior is con- 
ditioned on the quality of the estimation of the distribution, which is by itself, 
a difficult problem. If a parametric model is employed, the main difficulty is 
choosing the right model, often leading to oversimplifying assumptions on the 
model. On the other hand, using non-parametric estimators of the data distri- 
bution may require additional parameters that need to be tuned, resorting to 
computationally expensive procedures that are prone to over fitting; therefore, 
smoothing or capacity control mechanisms should be considered, as well. 
Despite the above mentioned difficulties, Renyi's definition of entropy along with 
Parzen density estimation have been successfully applied to learning problems 
by using information theoretic quantities such entropy and relative entropy as 
objective functions pQ. At the core of this methodology lies the entropy es- 
timator introduced below. Renyi's entropy of order a is a generalization of 
Shannon's entropy obtained after relaxing the additivity property of entropy to 
the generalized ip mean, 

(x)^=^,- 1 (j^w^x^y (i) 

where = 1, Wi > and if) is a Kolmogorov-Nagumo function. For a 

parametrized family of functions psi a (x)2 i - a ~ 1 ^ x , the Renyi's entropy of a ran- 
dom variable X as a function of the parameter a > is given by, 

ff a (X) = -^—log 2 ( 2 ) 

where p is the probability mass function, or the probability density function (if 
X is a continuous random variable and the sum becomes an integral) of the 
random variable X, and X the support. Shannon's entropy corresponds to the 
limiting case a — > 1. We can immediately notice that the quantity of interest is 
the argument of the log function in @, which in the case of a > I corresponds 
to an inner product between p(x) and it transformed version g(p(x)), where g(-) 
is the non-negative monotonically increasing function g(y) = y"^ 1 . It can also 
be thought of as the expected value E[g(p(X))]. 

For a = 2, a rather simple yet elegant plug-in estimator of ([2|) can be derived 
using the Parzen window approximation. For a sample {xi}™ =1 C R d , of the 
random variable X, assumed to be i.i.d. and drawn from f{x), the Parzen 
density estimator, f(x) = ^ Y17=i K cr( x i x i)i using a Gaussian kernel n cr (x,y) = 
C exp ( — 2ct t II :c — yll 2 )' wn ere C is a normalization constant and a the width; 
plugging this estimate of the density into the integral J x f 2 (x)dx, yields: 

I ™ 

ff2p0 = -log-2 E 'V&rfc.Zi)- ( 3 ) 
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A similar expression ([3]), but using the kernel size a, arises from computing the 
empirical expectation E emp [/(X)]. The argument of the log of the first case, 
kernel width, ^/2<j, has been called information potential in analogy to the po- 
tential fields arising in physics [T]. The information potential can be shown to 
be a special case of a positive definite kernel called the cross information poten- 
tial that maps probabiThe same lity density functions in L2 to a Reproducing 
kernel Hilbert Space of functions [2]. This idea has been already exploited to 
solve optimization problems with information theoretic objective functions that 
bear close resemblance to kernel methods [3J . Formulating the problem in terms 
of the kernel brings an interesting interpretation of the entropy estimator that 
goes beyond the Parzen density estimation that initially motivated it. In other 
words, a measure of entropy can be obtained by directly applying a kernel to 
the data without having to consider the intermediate step of density estimation. 
Other approaches based on entropic graphs that do not build on distribution 
estimators have been also developed, recently. These "direct" approaches use 
fc-nearest neighbor graphs, minimum spanning trees or the traveling salesman 
problem [4] [6] to consistently estimate Renyi's a-entropy directly from data 
samples in R d . Nonetheless, these graph theoretic methods are restricted to 
entropy estimates for a G (0, 1), and can be difficult to use in adaptation due 
to their non-diffcrcntiablc nature. 

In the present work, we define functionals on certain positive definite matrices 
that fulfill properties of a measure of entropy without assuming that probabili- 
ties of events are known or have been estimated. Even though these functionals 
are highly resemblant to well-known definitions in quantum information theory, 
our approach differs in analysis and scope. In the quantum mechanical setting, 
the density matrix (operator) describes a mixtures of states the system may 
assume. In our context, the object of study is the Gram matrix of pairwise 
evaluations of a positive definite kernel as we follow the statistical learning set- 
ting where the only available information is contained in a finite i.i.d. sample 
Z = {zi}" =1 . Our data-driven entropy act as a measure of the lack (uncertainty) 
or presence (structure) of statistical regularities in a given sample represented 
by a Gram matrix. In the analysis of the information theoretic properties of 
the proposed functional we show that the choice of kernel is also key in defining 
measures of information directly from data. In particular, we show that a infi- 
nite divisible kernels (subject to normalization) are well suited for our purposes 
(obtaining a measure of entropy directly from data). 

We will start informally, motivating the idea of using a Gram matrix to compute 
measure of entropy by highlighting the relation between plug in estimation of 
Renyi's second order entropy based on Parzen windows and the computation of 
expectation of a an observable in quantum mechanics that employs the concept 
of density operator. Then, we define the entropy functional on positive definite 
matrices and show how this functional satisfies a set of similar axioms to those 
proposed by Renyi for a measure of entropy, as well as some basic inequalities 
involving Hadamard products, for which infinitely divisible matrices are partic- 
ularly well suited. Next, we review some basic concepts on infinity divisibility 
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followed by an analysis of the statistical properties of the Gram matrices in 
connection to information theoretic learning to provide a complete picture of 
the proposed approach. Finally, we carry out numerical experiments to test 
independence that compare favorable with the state of the art. 

2 Motivation 

The use of Hilbert spaces to represent data is not a new idea [3 [8], and it 
has become of common practice in machine learning under the name of kernel 
methods. One of the appeals of this approach is the ability to deal with algo- 
rithms in a rather generic way provided the kernel is well-fitted to the particular 
problem. This property has been recently exploited in many practical applica- 
tions where data is not necessarily given as vector in W, for example text, 
trees, point processes, functional data, among others [5J- It has been noticed 
that kernel induced mappings can be understood as means for computing high 
order statistics of the data and manipulating them in a linear fashion as first 
order statistics. Methods such as kernel independent component analysis [TO] , 
the work on measures of dependence and independence using Hilbert-Schrriidt 
norms [11] , and recent work on quadratic measures of independence |12j are just 
among the examples of this emerging line of work. 

2.1 Hilbert Space Representation of Data 

To motivate the use of positive definite matrices as suitable descriptors of data, 
we need to understand the role of the Hilbert space representation and how it 
naturally arises from the fundamental ideas of pattern analysis. Let (X, Bx) be 
the object space with cr-algebra Bx and a probability measure Px, defined on 
it. A function <j> : X H> R is called a feature. A representation is a family of 
features {(j)t}teT, where (T,Bt, Ht) is a measure space and is er-fimte. Let 
4>t be also bounded for all t G T, and let us denote <j>t(x) by <j>(t, x) where t E T 
and x G X. If we also require that for all fixed x and y in X, 

G(x,y) = J 4>(t,x)<t>(t,y)dftT(t) < oo. (4) 
r 

Then, the space F is defined as the completion of the set of functions F of the 
form [J, 

F(t)=^ f a i <l>(t,Xi), (5) 
t=i 

where a, G K, Xi G X, and VW G N, is a Hilbert space representation of the 
set X. Nevertheless, dealing explicitly with such an F may be difficult if not 
impossible for practical purposes. The following result gives an alternative way 

1 Even though it is not explicitly stated, we assume the construction of a linear space of 
real functions with domain T 
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to deal with the problem based on the bivariate function G(x, y) defined above. 
Consider the set of functions on X of the form 

f( x ) = y^«iG(x,Xi), (6) 

i=l 

where a, G R, X* G X, and ViV G N. Let us define the inner product between 
elements / = J^ =1 c<iG(x, Xi) and g = PiG(x,Xj) of the above set as: 

TV A/ 
»=1 j=l 

the completion of the above set is a Hilbert space H of functions on X. Moreover, 
H is a reproducing kernel Hilbert space with kernel G. Notice that for any finite 
set {xi\f =1 we have that 

Af Af 

} j y] a i ajG(xi,Xj) > 0, (8) 
i=i 3=1 

for all a G R N . Functions satisfying the above condition are called positive 
definite. 

Theorem 2.1 (Basic Congruence Theorem [13]): Let Hi and H2 be two ab- 
stract Hilbert spaces. Let X be an index set. Let {F(x), x G X}, be a family 
of vectors which span Hi. Similarly, let {f(x), x G X} be a family of vectors 
which span H2- Suppose that, for every x and y in X, 

(F(x),F(y)) 1 = (f(x),f(y)) 2 (9) 

Then the spaces Hi and H2 are congruent, and one can define a congruence ^ 
from H\ to H2 which has the property that ty(F(x)) = f(x) for x G X . 

Proposition 2.1 Let X be a compact space. The spaces T andH are congru- 
ent. 

Proof 2. 1 The congruence follows from the definition of T and H . For F = 
S"=i a i4>(f, x i) simply take ^ : T 1— !> H as: 

n 

= YV/ <f>(t,x i )<p(t,-)dn r (t) = f 

i=l Jx 

□ 

The above proposition allows us to perform the analysis of the representation 
of X on the equivalence classes that can be formed by using the function G to 
define relations between the elements of X. From the congruence, we can define 
a distance function between the representations of two elements x, y G X using 
the function G as follows: 

d 2 {<j>{t,x),(j>{t,y)) =G(x,x)+G(y,y)-2G(x,y), (10) 

for convenience we write d 2 (4>(t, x), <j)(t, y)) as d 2 (x, y). 



5 



2.2 The Cross-Information Potential RKHS 



For the set T of probability density functions that are square integrable in R n , 
we can define the cross-information potential V (CIP) as a bilinear form that 
maps densities fj £ J 7 to the real numbers trough the integral, 

V(fi,fj) = J fi(x)fj( X )dx (11) 

It is easy to see that for a basis of uniformly bounded, square integrable prob- 
ability density functions, V defines a RKHS on the span{J 7 } (up to comple- 
tion). Now consider the set Q := {g = X)"=i <^iH a {xi, -)\xi € R", YmL\ a « = 
1, and oti > 0}, where "Parzen" type of kernel, that is n a is symmet- 

ric, nonncgative, has bounded integral (can be normalized), and shift invariant 
with a as the scale parameter; V also defines a RKHS JC on Q. Clearly, for any 
g € Q, we have ||V(<7, -)\\k < \\K a {x, - ) || ; therefore, K. is a space of functionals 
on a bounded, albeit non-compact set. Notice that the cross information po- 
tential, by definition, is a positive definite function that is data dependent, and 
thus different from the instance-based kernel representation in machine learning. 
Nevertheless, the empirical estimator ([3]) links both Hilbert space representa- 
tions. If we construct the Gram matrix K with elements ify = K a {xi,Xj), it is 
easy to verify that ^ corresponds to: 

H 2 (X) = -lo g r^tr(KK)V (12) 

As we can see, the information potential estimator can be related to the norm of 
the Gram matrix K defined as ||K|| 2 = tr(KK). (TT2")) bears a lot of resemblance 
with well-known operational quantities from quantum information theory, where 
the density matrix (operator) p can be employed to compute expectation over 
an observable represented by the operator A as (A) = tr(pA). For instance, 
Von Neumann's entropy |14j corresponds to 

S(p) = -tr(pIogp), (13) 

and quantum extensions of Renyi's entropy |15j are given by 

S a (p) = -^—tT( P a ). (14) 
1 — a 

While some of the properties of (|13|) and (|T4| also apply to (fT2"]) , we need to point 
out that our approach to this functional is very different since we deal with the 
Gram matrices obtained from pairwise evaluations of a positive definite kernel 
on a data sample. Consequently, our analysis will not only involve the functional 
but also the kernels employed to construct K. 
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3 Positive Definite Matrices, and Renyi's En- 
tropy Axioms 

Hermitian matrices are considered as generalizations of real numbers. It is possi- 
ble to define a partial ordering on this set by using positive definite matrices; for 
two Hermitian matrices A,BG M n , we say A )p= B if A — B is positive definite. 
Likewise, A >~ B means that A — B is strictly positive definite. The following 
spectral decomposition theorem relates to the functional calculus on matrices 
and provides a reasonable way to extend continuous scalar- valued functions to 
Hermitian matrices. 

Theorem 3.1 Let D C C be a given set and let M n (D) := {A 6 M n : 
A is normal and cr(A) C D}. If f(t) is a continuous scalar-valued function 
on D, then the primary matrix function 

I /(Ai) ••• \ 
f(A)=u\ : -.. : \U* (15) 
V ■■■ f(\ n ) ) 

is continuous on Af n {D), where A — UAU* , A = diag{\\, . . . ,X n ), and U 6 M n 
is unitary. 

Now we are ready to define a matrix analogue to Renyi's entropy that will 
be applied to Gram matrices constructed using a positive definite function. 

Consider the set A+ of positive definite matrices in M n for which tr(^4) < 1. 
It is clear that this set is closed under finite convex combinations. 

Proposition 3.1 Let A e A+ and B e A+ and also tr(A) = tr(B) = 1. The 
functional 

S a (A) = -^—log 2 (trA a ), (16) 
1 — a 

satisfies the following set of conditions: 
(i) S a (PAP*) = S a (A) for any orthonormal matrix P G M n 

(ii) S a {pA) is a continuous function for < p < 1. 
(Hi) S a (—I) = log 2 n (entropy is exhaustive). 

(iv) S a {A® B) = S a (A) + S a (B). 

(v) If AB = BA = 0; then for the strictly monotonic and continuous function 
g{x) = 2( Q ~ 1 ^ for a^l and a > 0, we have that: 

S a (tA + (1 - t)B) =g~ 1 (tg(S a (A))+ 

+ (l-t)g(S a (B))). [ ' 
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Proof 3.1 The proof of |§) easily follows from Theorem \3.1\ Take A = UAU* 
now PU is also a unitary matrix and thus f(A) = f(PAP*) the trace functional 
is invariant under unitary transformations. For p^). the proof reduces to the 



continuity of \og 2 {p) a . For ifm)). a simple calculation yields trA a = f ij 



Now, for property Iroj) . notice that iftrA = tiB = 1, then, tr(A (£) B) = 1. Since 
A = UAU* and B = VTV* we can write A®B = {U® V){A ® Y)(U ® V)* , 
from which tx(A®B) a = tr(A<g>r) Q = tr(A")tr(r Q ) and thus fvfl) proved. 
Finally, fcdf) notice that for any integer power k of tA + (1 — t)B we have: 
(tA + (1 - t)B) k = (tA) k + ((1 - t)B) k since AB = BA = 0. Under extra 
conditions such as /(0) = the argument in the proof of Theorem \3.1\ can be 
extended to this case. Since the eigen-spaces for the non-null eigenvalues of 
A and B are orthogonal we can simultaneously diagonalize A and B with the 
orthonormal matrix U , that is A = UAU* and B = UTU* where A and T are 
diagonal matrices containing the eigenvalues of A and B respectively. Since 
AB = BA = 0, then AT = 0. Under the extra condition /(O) = ; we have that 
f(tA + (1 - t)B) = f(tA) + /((l - t)B) yielding the desired result for 



□ 



Notice also that if rank(A) = 1, S a = for a^O. 
The following important property is also true. 

Proposition 3.2 Let A € A+, and tx(A) = 1. For a > 1 



S a (A) < S a (-I) 
n 



(18) 



Proof 3.2 Let {Xi\ be the set of eigenvalues of A. Then we have that, 

"tr^A"- 1 )"! 



S a {A) - S a {-I) = 
n 



< 



1 


1 




a 




1 




1 




a 




1 




1 




a 




1 




1 




a 



log 2 



log 2 



»-(«-!) 



^ Kfa{n\i 



< lo 



— — V Xi log 2 f a (nXi); 
1 — a ^-^ 

i 

] 

?4 



(19) 
(20) 
(21) 
(22) 
(23) 

(24) 



Where (|22p and (|24p are due to Jensen's inequality. 



□ 
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The above characterization applies to all unit-trace positive definite matrices, 
but it also tells us that some matrices do not provide an information theoretic 
interpretation. For instance, any unit trace rank one matrix will have zero 
entropy even though it might contain information. Consider the matrix L = 
— 11 T where n\ of the entries are 1 and the remaining n — n\ entries arc — 1 . This 
can be seen as a vector that represents each of the instances by their class label. 
Notice that, evaluating the entropy functional defined above yields for any a. 
However, with a two-column matrix M, for which the columns represent the 
class and the rows the samples, that encodes the class memberships as Mij = 1 
if the i-th sample belongs to the j-th class, and otherwise, we obtain a more 
reasonable quantity using L = MM T , related to a binomial distribution with 
p = — . Interestingly, we can simple relate 1 and M by 1 = M (1, — 1) T . As 
we can see from the example given above, the functional alone does not fully 
characterize the problem. In the following sections, we will address this issue by 
considering a particular context in which Hadamard products between positive 
definite matrices arise. 

3.1 Entropy inequalities for Hadamard Products 

In Propositions 13.11 and 13.21 we did not considered Hadamard products of pos- 
itive definite matrices. This product may be of interest in the case we have 
two matrices A and B in A„ with unit trace where there exists some relation 
between the elements and B t j for all i and j. The Hadamard product can 
be useful in developing analogues to joint entropies, where each one the ma- 
trices involved in the Hadamard product represents a random variable. Before 
we present the main result of the section, we need to introduce the concept 
of majorization and some results pertaining the ordering that arises from this 
definition. 

Definition 3.1 (Majorization): Let p and q be two nonnegative vectors in W l 
such that Y^i=iPi = Y^i=i1i < 00 ■ We sa y P ^ 1' 1 majorizes p, if their 
respective ordered sequences p\i\ > P[2] > ■ ■ ■ > P[ n ] an d <7[i] > Q[2] _t " " " > Q[n] 
denoted by {p^j-Lj and {p[i]}" =1) satisfy: 

k k 

< 9 M for k = l,...,n (25) 

i=l i=l 

It can be shown that if p =4 9 then p = Aq for some doubly stochastic matrix 
A [T5]. It is also easy to verify that if p =4 Q an d p =4 h then p =<! tq + (1 — t)h 
for t 6 [0, 1]. The majorization order is important because it can be associated 
with the definition of Schur-concave (convex) functions. A real valued function 
/ on M™ is called Schur-convex if p =<! q implies f(p) < f(q) and Schur-concave 

if /(g) < /(p). 
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Lemma 3.1 The function f a : S n i— > R+ (S n denotes the n dimensional sim- 
plex), defined as, 

n 

UP) = -, log 2 J>?, (26) 

L — a * — ' 

i=i 

is Schur- concave for a > 0. 

Notice that, Schur-concavity (Schur-convcxity) cannot be confused with con- 
cavity (convexity) of a function in the usual sense. Now, we are ready to state 
the inequality for Hadamard products. 

Proposition 3.3 Let A and B be two nxn positive definite matrices with trace 
1 with nonnegative entries, and Aii = ^ for i = 1,2, ... ,n. Then, the following 
inequalities hold: 

ft) 

Mmx^)-^' (27) 

(ii) and 

S -(^B))- SaiA) + Sa{B) - (28) 

Proof 3.3 In proving (f2"T)) and (|28p. we will use the fact that S a preserves 
the (inverse) majorization order of nonnegative sequences on the n-dimensional 
simplex. First consider the identity 

x T (A o B)x = tr( AD X BD X ) = - 

n 

In particular, if {xi}™ =1 is an orthonormal basis for W 1 , 

n 

tv(A oB) = ^2 xJ(A o B)xi 

i=l 

If we let {xi}f =1 be the eigenvectors of AoB ordered according to their respective 
eigenvalues in decreasing order, then, 



i=l i=l 

A- 



^TxJ(AoB)xi = {AD Xi BD Xi ) 

i=l 
1 k 

j=i 
i k 
i=i 
i k 

-Y,yT B vii (29) 



n 
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where k = 1, . . . , n and {yi}" =1 are the eigenvectors of B ordered according to 
their respective eigenvalues in decreasing order. The inequality ([29]) is equivalent 
to n\(A o5)^ A(i3) 7 that is, the sequence of eigenvalues of (A o _B)/tr(A o B) 
is majorized by the sequence of eigenvalues of B, which implies (|27[) . 
To prove (|28[) notice that for A we have two extreme cases A = —I and A 
II. Taking A = ^11 T we have that 

£A,( B ) = »i;I t ,(„X,BD„) -J> (^y) (30) 

i=l i=l i=l v v ' 7 

the other extreme case where A = —I we have, 

i£ MB )<I<„£>, = £,,(^) (31) 

1=1 1=1 1=1 V V ' ' 

where {Xi(X)} are the eigenvalues of X in decreasing order and {di(X)} are 
the elements of the diagonal of X ordered in decreasing order. The inequalities 
flU and J3TJ imply\M 



□ 

3.2 The Tensor and Hadamard Product Entropy Gap 

The mutual information of a pair of random variables X and Y represents the 
amount of reduction in uncertainty from knowledge the marginal distributions to 
full knowledge the joint distribution. In the Shannon definition this information 
gain can be expressed as: 

I(X;Y)=H{X) + H(Y)-H(X,Y) (32) 

where H(X) and H(Y) are the marginal entropies of X and Y, and H(X, Y) is 
their joint entropy. In analogy we can compute the quantity: 

I a (A; B) = S a (A) + S a (B) - S a ( J^L j (33) 

for positive scmidcfinitc A and B with nonnegative entries and unit trace such 
that An = i for all i = 1, . . . , n. Notice that the above quantity is nonnegative 
and satisfies 

S a (A) >I a (A;A). 

3.3 The Single and Hadamard Product Entropy Gap 

Another quantity of interest is the conditional entropy of X given Y, which 
can be understood as the uncertainty about X that remains after knowing the 
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joint distribution of X and Y . In Shannon's definition the conditional entropy 
H{X\Y) can be expressed as: 

H(X\Y) = H(X, Y) - H(Y). (34) 

Extending this idea to the matrix case yieldt@: 

H a (A\B) = S a ( t *° o B B) ) S a (B) (35) 

for positive semidefinite A and B with nonnegative entries and unit trace such 
that An = i for alH = 1, . . . , n. The above quantity is nonnegative and upper 
bounded by S a (A). In the following section, we will see how infinite divisible 
matrices relate Hadamard products with the concatenation of representations 
of the variables we want to analyze jointly. 

4 Infinitely Divisible Functions 

4.1 Direct-Sum and Product kernels 
4.1.1 Direct-Sum kernels 

Let K\ and K2 be two positive definite kernels defined on X x X. The kernel 
— K\ -\- K2, defined as k<$(x, y) = Ki(x,y) + K2(x,y), is a positive definite 
kernel. The above function is called direct sum kernel and it is the reproducing 
kernel of a space "H© of functions of the form / = f\ + f 2 , where /i € "Hi and 
h € H2, and Hi and H2 are the RKHSs defined by n-y and k 2 , respectively. 
Consider the Hilbert space T~L = Hi x H2 formed by all pairs (fi,f 2 ) coming 
from Hi and H2 , respectively. It is possible that some functions / 7^ belong- 
to both Hi and H2 at the same time. These functions form a set of pairs 
(/) ~ /) € H, which turn out to be a closed subspace of H denoted by Ho, such 
that, H = Ho © Ho - Therefore, the linear correspondence f(x) = fi(x) + /^(z) 
between / e and (/i,/2) £ H is such that all elements in Ho map to 
the zero function in H@ and the elements of H@ and Hq are in one to one 
correspondence. The norm of / £ "H© can be defined from the correspondence 
(9i(f), 92(f)) as: 

II/IIk© = \\(9i(f),92(f))\\ 2 H - HsilOH^ + \\92(f)\\ 2 n 2 (36) 

Notice that, (<?i(/), <?2(/)) is the decomposition of / into the pair H with mini- 
mum norm in this space. The following theorem states the result |18j . 

Theorem 4.1 If Ki(x,y) is the reproducing kernel of the class Hi, with norm 
|| • then n(x,y) = Ki(x,y) + n 2 (x,y) is the reproducing kernel of the class 

2 There is no consensus about what should be the defition of conditional entropy in Renyi's 
case, (see |17| ) 
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of functions of all functions f = fi + f 2 with fi G Hi, and with the norm 
defined by 

||/|| 2 e = min{||/ 1 ||? + ||/|||}. (37) 
The minimum is taken over all decompositions f = fi + fi with fi G Hi 

4.1.2 Product Kernel and Tensor Product Spaces 

Consider two positive definite kernels «i and k 2 defined on X x X and y x y, 
respectively. Their tensor product m ® k 2 : (X x y) x (X x y) defined by: 

ki (g> K 2 ((xi,yi), (xj,yj)) = Ki(xi,Xj)K2(yi,yj) (38) 

is also a positive definite kernel. Note that we can consider two kernels h\ and 
K2, both defined on (Xxy) x (X x y), such that ki{{xi, yt), (xj, yj)) = K±(xi, Xj) 
and K2((xuVi),(zj,Vj)) = K>\{vuVj)\ the kernel k~i ■ « 2 ((»») J/i), (^j, 2/j)) 

= ( x j,yj))^({xi,yi), (xj,yj)), 

= ki ® K2((xi,yi), (xj,yj)), 

and is positive definite by the Schur Theorem. Let us look at the space of 
functions that = «i ® k 2 spans. Let H® = H\®H2, where Hi and rl 2 are 
the RKHSs spanned by m and fc 2 , respectively. The space is the completion 
of the space of all functions / on X x y of the form: 

n 

f(x,y) = ^2f^(x)f^(y) (39) 

1=1 

(i) (1) 

with /-J € Hi and / 2 ^ and inner product, 

n m 

(/>5)^ = EE(/i (i) >^' ) )i(/ 2 (i) ,^ ) >2. (40) 
»=i j=i 

The functions / and g may have multiple representations of the form ([55} with- 
out changing (/, 5)®. Let us look at the case where ^ and 3^ are the same 
set. The following theorem describes the kernel derived from the restriction of 
Ki ® k 2 to the diagonal subset oi X x X |18| . 

Theorem 4.2 For x,y G <f, ifte kerneln(x,y) = Ki(x,y)K 2 (x,y) is the repro- 
ducing kernel of the class H of the restrictions of the direct product H® = 
Hi ® H 2 to the diagonal set formed by all elements (x,x) G X x X. For any 
such restriction f , \\f\\ = min \\g\\® for all g G H® such that f(x) = g{x, x). 

4.2 Negative Definite Functions and Infinite Divisible Ma- 
trices 

4.2.1 Negative Definite Functions and Hilbertian Metrics 

Let M. = (X, d) be a separable metric space, a necessary and sufficient condition 
for M. to be cmbcddablc in a Hilbcrt space H is that for any set {xi} C X of 
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n + 1 points, the following inequality holds: 

71 

ctictj (d 2 (x ,Xi) + d 2 (x ,Xj) - d 2 (xi,Xj)) > 0, (41) 
for any a. <E R". This condition is equivalent to 

71 

^2 a i a j d 2 (x i ,x j ) < 0, (42) 

i,j=0 

for any a. 6 R™ +1 , such that 5Z™_ ai ~ 0- This condition is known as negative 
dcfiniteness. Interestingly, the above condition implies that exp(— rd 2 (xi,Xj)) is 
positive definite in X for all r > [19] . Indeed, matrices derived from functions 
satisfying the above property conform a special class of matrices know as infinite 
divisible. 

4.2.2 Infinite Divisible Matrices 

According to the Schur product theorem A ^ implies that A on = A o A o ■ ■ ■ o 
A )p for any positive integer n. An interesting question is when the above 
holds if one were to take fractional powers of A, that is, when A°~ ^= for 
any positive integer m. This lead to the concept of infinite divisible matrices 

[HUH]. 

Definition 4.1 Suppose that A )p and aij > for all i and j. A is said to be 
infinite divisible if A or )p for every nonnegative r. 

Infinite divisible matrices are intimately related to negative dcfiniteness as we 
can see from the following proposition 

Proposition 4.1 If A is infinite divisible, then the matrix Bij = — log Aij is 

negative definite 

From this fact it is possible to relate infinitely divisible matrices with isometric 
embedding into Hilbcrt spaces. If we construct the matrix, 

A J =%-^ + %), (43) 

using the matrix B from proposition 14.11 There exist a Hilbert space H and a 
mapping <f> such that 

Dii = U®-m\$i- (44) 
Moreover, notice that if A is positive definite — A is negative definite and exp A^ 
is infinitely divisible. In a similar way, we can construct a matrix, 

D ij = -A ij + -(Au + A jj ), (45) 

with the same property (|44")) . This relation between (j43|) and (|45|) suggests a 
normalization of infinitely divisible matrices with non-zero diagonal elements 
that can be formalized in the following theorem. 
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Theorem 4.3 Let X be a nonempty set and d\ and di two metrics defined on 
it, such that for any set {xi}f =1 , 

n 

otiOtjd\(xi,Xj) < 0, (46) 



for any a £ R™, and 53<=i a i = 0, is true for t = 1,2. Consider the matrices 



Aj? = exp — dg(xi, Xj) and their normalizations defined by: 



4 ? = A' r— ( 47 ) 

T/ien, «/ = for any finite set {xi}f =1 C X , there exist isometrically 
isomorphic Hilbert spaces Jii and that contain the Hilbert space embeddings 
of the metric spaces (X,di), £=1,2. Moreover, A^> are infinitely divisible. 

Notice that the normalization procedure for infinitely divisible matrices pro- 
posed in Theorem 14.31 is beautifully justified as the maximum entropy matrix 
among all matrices for which the Hilbert space embeddings are isometrically 
isomorphic. 



5 Statistical Properties of Gram Matrices and 
their connection with ITL 

Let (X, Bx, Px) be a countably generated measure space. Let k : X x X >-^> M. 
be a reproducing kernel and the mapping cf> : X M> % such that k(x, y) = 
((f>(x),<f)(y)}, and: 

E x [k(X,X)]=E x [H(X)\\ 2 ] 

= [{4>(x),<Kx))dP x (x) = l ( 48 ) 
x 

Since Ex [\\(f>(X)\\ 2 ] < oo we can define an operator G : % H> H. through the 
following bilinear forrrj^|: 

Q(f,9) = (f,Gg) = I {f,${x)){${x),g)<iP x {x) (49) 
x 

notice that / and g belong to T-L and from the reproducing property of k, we 
have that f(x) = (f,<t>(x)) and thus Q{f,g) = E x [f{X)g(X)\. From the nor- 

3 Notice, that / eH => f 6 L 2 {PX). First, \f{x)\ = \{},<j>{x))\ < \\f\\K(x,x)i, and thus 
f(x) 2 < \\f\\ 2 K(x,x). Since f K(x,x)dP x = 1, we have ||/||| = f f 2 dP x < ||/|| 2 
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malization condition (|48[) we have that: 



N-H 

tr(G)=^a(Vi,Vi) 

Z (50) 
= ]T / (V>i,<K*)><^),V>i>dP*(z) = 1 



where {V'iljji is a complete orthonormal basis for "H, and thus G is trace class. 
5.1 The trace of G a 

In the definition of the entropy like quantity for positive definite matrices, we 
employ functional calculus using the spectral theorem to compute tr(A Q ). In 
particular, we consider the Gram matrix K constructed by all pairwise eval- 
uations of a normalized infinite divisible kernel k and scale by jj such that 
jj K ( x ii x i) = 1- The above scaling can be thought as normalizing the 

kernel such that for the empirical distribution Pn , 

E cmp [ K (X,X)}=E cmp [U(X)\\ 2 ] 

(4>{x),<t>{x))&P N {x) 



(51) 

1 N 

— K ( X i> X i) = 1 



It follows immediately from Proposition 15.11 that tr {c c ^j — tr ((-i-K)"). As 

we have seen, G defines a bilinear form Q that coincides with the correlation of 
functions on X that belong to the RKHS induced by n. Let us look at the case 
a = 2, which is the initial motivation of this study and has been extensively 
treated in ITL in relation to plug in estimators of Renyi's entropy. This case is 
also important since there are interesting links with maximum discrepancy and 
Hilbert Schmidt norms. In the limit case we have: 
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N n Nn 

tr(G 2 ) = ^(^,G 2 ^) = ^(G^,G^) 

i = l i=l 

= EH G ^|| 2 = ||G||| S 

i=l 
Nn 



<Kv){<Kv),*l>i))dPx(p:)dPx(v) 
{4>(x),fKy)){<Kx),... 



X X 

Nn 



J2M^i,HvWPx(x)dP x (y) 



(cj ) (x), ( l ) (y)) 2 dP x (x)dP x (y) 



x x 

II K 



/^llr (52) 



where ||/zx|| K 2 denotes the squared norm of a the a mapping P x H> fix hi the 
RKHS /C induced by the kernel K 2 (x,y) = k(x, y)re(x, j/). In the more general 
case of any a > 1 we have, 

iV-w Nn 

tr(G Q ) = ^2(ipi,G a ipi) =^2(Gipi,G a ~ 1 il)i) 
t=i t=i 

= e n^^{x))^{x),G a - i ^)dP X ( X ) 



(«^(x),G a! - 1 ^(x))dP^(x) 
/i(x, x)dP^(x) 



.V 



.V 



(53) 



notice that ft.(x, y) itself, is a positive definite function on X x X that also 
depends on P^(x). 

Figure [1] summarizes the relation between spaces that are considered in the 
proposed framework. The object space X can be directly mapped into H K 
using an infinitely divisible kernel n, or it can be mapped to a Hilbert space 
Ha, if a negative definite function d, is employed as the distance function. The 
spaces T-L K and Hd are related by the log and exp functions. 
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5.2 The Spectrum of G and Consistency of its Estimator 

By definition, it is obvious that the bilinear form Q is a positive definite kernel 
in H since 

N 

J2»i a jS(fi,fj)>0 (54) 

for any finite set {fi}fLi Q H- Notice from (|49|) Q is symmetric and thus G is 
self adjoint. Moreover, since Q is positive definite, it can be shown that G is 
a positive definite operator. Instead of dealing directly with the spectrum of 
G, for which we should know the probability measure Px, we are going to look 
at the spectrum of Gn and the convergence properties of this operator. Based 
on the empirical distribution Pjy = jjS Xi (x), the empirical version Gn of G 
obtained from a sample {xi} of size N is given by: 



(55) 



(f,G N g) = g(f,g) = J (f,<Kx))(<Kx),g)dP N (x) 
x 

1 N 



N 
i=i 



Proposition 5.1 (Spectrum of Gn): For a sample {xi}^ =1 , let Gn be defined 

as in (|55[). anrf Zet K fee t/ie Gram matrix of products Kij = (<fi(xi),4>(xj)}. 
Then, Gn has at most N positive eigenvalues Xk satisfying: 

^Ka, = X l a i . (56) 
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Moreover, NXi are all the positive eigenvalues o/K. 

Proof 5.1 First notice that for all f _L span {(f>(xi)}, we have Gjyf = 0, and 
thus any eigenvector with a corresponding positive eigenvalue must belong to the 
span {(f>(xi)}, which is an N dimensional subspace and therefore, since Gn is 
normal there can be at most N positive eigenvalues. Now let v be an eigenvector 
ofGjsi,, we have that 



Then, for each <p(xi) it is true that 

1 N 

(cj>{xi),G N v) = ^^(<K^i),0(zi))(0(a:j),u) = X((j>(xi),v). 

By taking ai = ((f>(xi), v) we can form the following system of equations: 

—Ka = Xa. (57) 
N 

which is true for all positive eigenvalues of Gn. 

□ 

Proposition 5.2 (Compactness of G): G : H i-> H defined by is compact. 

Proof 5.2 We will show that if g n ^> g in H ({g n } is weakly convergent), 
implies that Gg n — > g strongly in H . Since H is a Hilbert space we only need to 
show that 

\\G 9n \\ ^ \\Gg\\. (58) 
Since any f E H is also in Li(Px), 

(Gg n ,Gg n ) = [ g n (x)(Gg n )(x)dP x (x), 

(59) 



x 



(Gg,Gg n ) = / g(x)(Gg)(x)dP x (x). 



x 



Moreover, 



\g n {x)\ < \\g n \\K(x,x)z, 
\Gg n {x)\ < \\Gg n \\K{x,xY 



(60) 



and therefore, \g n (Gg n )(x)\ < \\g n \\\\Gg n \\K(x,x). Since both {g n } md {Gg n } 
are weakly convergent in H their norms are bounded; then {g n (Gg n )} is bounded 
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by the L\(Px) norm of k(x,x) (up to a constant). The weak convergence 
property of {g n } implies that g n (x) — > g{x) point-wise, which also implies 
g n (Gg n )(x) — > g(Gg)(x) point-wise. Since these functions are uniformly bounded 
by the integrable function k(x, x), by Lebesgue dominated convergence in L\(Px) 
we have: 

J g n (Gg n )(x)dP x (x) -> J g(Gg)(x)dP x (x), (61) 
x x 
which proves that \\Gg n \\ — > ||G S ||, and thus G is compact. 

□ 

The following theorem found in |22] is a variational characterization of the 
discrete spectrum (eigenvalues) of a compact operator in a separable Hilbert 
space. 

Theorem 5.1 Let A, B be self adjoint operators in a separable Hilbert space 
%, such that B = A + C , where C is a compact selfadjoint operator. Let {7^} 
be an enumeration of nonzero eigenvalues of C . Then there exists extended 
enumerations {a?}, {Pj} of discrete eigenvalues for A, B, respectively, such 
that: 

]>>(/?,- a,) < ]>>( 7fc ), (62) 

j k 

where if is any nonnegative convex function on M., and <p(0) = 0. 

The definition of extended enumeration {cti\ according to Theorem 15.11 means 
that for a selfadjoint operator A in H only the discrete eigenvalues with finite 
multiplicity m are listed m times and any other values are listed as zero. If we 
have a bounded kernel, which in the case of a normalized version of the infinitely 
divisible matrix is always the case, we can apply Hoeffding's inequality. Let 
be a sequence of zero mean, independent random variables taking values in a 
separable Hilbert space such that ||$j|| < C for all i then: 

Pr 

note that (Gn — G) is compact operator. Let iftj be a complete orthonormal 
basis for H, we can set that, 

N n N 

J2(Gn - G)ipj = -J2 W*0IW*0ll - E[<KX)W(X)\\]) . (64) 

3=1 i=l 

Combining (1531 with (|64p and Theorem 15.11 yields the following result. 



1 " 



> £ 



< 2 exp - 



Ne 2 
2CP 



(63) 
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Table 1: List of distributions used in the independence test along with their 
corresponding original and resulting kurtosis after centralization and rescaling 



Distribution 


Kurtosis 


Student's t distribution 3 DOF 
Double exponential 
Uniform 


oo 
3.00 
-1.20 


Student's t distribution 5 DOF 
Exponential 

TV X' J. O J 1_ 1 j- ' 1 

Mixture, 2 double exponentials 


6.00 
6.00 
— 1.16 


Symmetric mixture, 2 Gaussian, multimodal 
Symmetric mixture, 2 Gaussian, transitional 
Symmetric mixture, 2 Gaussian, unimodal 


-1.68 
-0.74 
-0.50 


Asymmetric mixture, 2 Gaussian, multimodal 
Asymmetric mixture, 2 Gaussian, transitional 
Asymmetric mixture, 2 Gaussian, unimodal 


-0.53 
-0.67 
-0.47 


Symmetric mixture, 4 Gaussian, multimodal 
Symmetric mixture, 4 Gaussian, transitional 
Symmetric mixture, 4 Gaussian, unimodal 


-0.82 
-0.62 
-0.80 


Asymmetric mixture, 4 Gaussian, multimodal 
Asymmetric mixture, 4 Gaussian, transitional 
Asymmetric mixture, 4 Gaussian, unimodal 


-0.77 
-0.29 
-0.67 



Theorem 5.2 For a positive definite kernel k satisfying (|48[) , and k(x,x) < C. 
Let Xi and Xi the extended enumerations of the discrete eigenvalues of G and 
Gn , respectively. Then, with probability 1—5 

Proof 5.3 Apply the result of Theorem \5.1\ using ip(x) = x 2 . 

□ 

6 Experiments 

Here, we develop a test for independence between random elements X and Y 
based on the gap between the entropy of the tensor and Hadamard products 
of their Gram matrices applied to an experimental setup similar to [23]. We 
draw N i.i.d. samples from two randomly picked densities corresponding to 
the ICA benchmark densities [10]. These densities are scaled and shifted such 
that they have zero mean and unit variance (see Table [T]) . The pair of random 
variables are mixed using a 2-dimensional rotation matrix with rotation angle 
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6 e [0,7r/4]. Gaussian noise with unit variance and zero mean is a added 
as extra dimensions. Finally each one of the random vectors is rotated by a 
random rotation (orthonormal matrix) in R 2 , and R 3 , accordingly. This causes 
the resulting random vectors to be dependent across all observed dimensions. 
We perform experiments varying angles, samples sizes and, dimensionality. The 
test compares the value of the gap: 



S a (K x ) + S a (K Y ) - S a 



K X ° Ky 

tr(Kjr o K Y ) 



(66) 



where Kx and Ky are the Gram matrices (Gaussian kernel) for the X and Y 
components of the sample {(xt, yi)}^Li, with a threshold computed by sampling 
a surrogate of the null hypothesis Ho based on shuffling one of the components of 
the sample k times, that is, the correspondences between Xi and j/j are broken 
by the random permutations. The threshold is the estimated quantile f — r 
where r is the significance level of the test (Type I error). The hypothesis H , 
X is independent of Y, is accepted if t. Here, we report results forhe gap (|6"6"j) 
is below the threshold, otherwise, we reject Hq. In all our experiments k = 100 
and a = 1.01. The solid lines in Figures [2(a)] |2(b)[|2(c)[|2(d)[ and |2(e)| show the 
estimated probability of Hq being accepted for the proposed test with t = 0.05. 
The dotted lines in Figures 2(a)[ |2(b)| |2(c) are the acceptance rates obtained 
using the kernel-based statistic proposed in 23i , 

T n =— .Here,wereportresultsfor n Li l (xi—Xj)L' h (yi,yj)+ 



2 



E 



,4=1 



1 " \ ( 1 




(67) 



where Lh and L' h are characteristic kernels on 



[24] , The dotted lines in 
Figures 2(d) 2(e) are the acceptance rates for an statistic based on the difference 



between joint and marginal Rcnyi's entropies estimated using the minimum 
spanning tree graph as proposed in [3], namely, 



H a (X n ) + H a (Y n ) — H a (X n ,Y n ), 



(68) 



where H a (Z n ) = j^-logmin^ 



T is the set of vertices of the entropic 



graph and |e| denotes the Euclidean norm of the edge e. Notice that here we 
don't consider the bias-correction constant that is presented in [4] since it will be 
present on the threshold, as well. The results are averages over 100 simulations 
for each one of the parameter configurations. In the case of X, Y £ R (Figure 



2(a) ), the type II error is low even for small sample sizes, whereas the dependence 
becomes more difficult to detect as d increases, requiring a larger N to obtain 



22 



Angle (x ji/4) 
(a) 



Angle (x rc/4) 
(b) 



Dim 3, a 2 = 3 




0.2 0.4 0.6 0.8 1 

Angle (x 



Dim 2, c 2 = 2 Dim 3, o 2 = 3 




Angle (xjc/4) Angle (xn/4) 



(d) (e) 



Figure 2: Results of the independence test based on the gap between tensor 
and Hadamard product entropies for different sample sizes and dimensionality. 
Figures 2(a)| 2(b) [ an c]2(c) correspond to the estimated acceptance rates for Hq 
for random variables of 1, 2, and 3 dimensions, and compare the results between 
the proposed test (solid lines) and the kernel-based statistic (dotted) proposed in 
[23]. Figures [2(d)| an c|2(e)| correspond to the estimated acceptance rates for Hq 
for random variables of 2, and 3 dimensions, and compare the results between the 
proposed test (solid lines) and the minimum spanning graph entropy estimator 
(dotted) proposed in gJ.The larger the angle the easier to reject independence 
H . 
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1 

0.9 




Figure 3: Results of the independence test based on the gap between tensor and 
Hadamard product entropies for different kernel sizes a and entropy orders a 
for a fixed sample of size 1024 and rotation angle 6 = 5 . The dimensionality of 
the of the random variables is d = 2. 

an acceptable type II error. Our results are competitive to those obtained with 
the kernel based statistic (|67|) and the entropic graph estimator (|G8l) . The three 
methods perform relatively similar for large angles, but it can be noticed that 
the proposed method works better when the angle is close to 0. It is important 
to point out that in all cases (the proposed statistic using the gap, the one in 
(|67p . and (|68[) ) the threshold was empirically determined by approximating the 
null distribution using permutations on one of the variables. Whether we can 
provide a distribution of the null hypothesis for (|66[) is subject of future work. 
Figure [3] shows the influence of the parameters in the power of the proposed 
independence test. The behavior of the test for different orders a and kernel 
sizes a can be explained from the spectral properties of the Gram matrices. 
For smaller kernel sizes the Gram matrix approaches to identity and thus its 
eigenvalues become more similar, with 1/n as the limit case. Therefore, the gap 
(|66p monotonically increases as a — > 0, so does the gap for the permuted sample. 
Since both quantities have the same upper bound, the probability of accepting 
Ho increases. The other phenomenon is related to the entropy order, it can be 
noticed that the larger the order a the smaller the kernel size a that is needed 
to minimize the type II error. The order has an smoothing effect in the resulting 
operator defined in (|53p . Large a will emphasize on the largest eigenvalues of 
the Gram matrices that are commonly associated with slowly changing features. 

7 Conclusions 

We presented an estimation framework of entropy-like quantities based on in- 
finite divisible matrices. By using the axiomatic characterization of Rcnyi's 
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entropy, a functional on positive definite matrices is defined. This functional re- 
sembles the definitions of entropy in quantum information theory; however, our 
analysis is different from QIT since we need to consider not only the functional 
but also the kernel employed to compute the proposed measures of entropy. 
The use of Hadamard products allows us to define quantities that are similar 
to mutual information and conditional entropy, and set the conditions that lead 
to infinite divisibility. We showed some properties of the proposed quantities 
and their asymptotic behavior as operators in reproducing kernel Hilbert spaces 
defined by distribution-dependent kernels. Numerical experiments showed the 
usefulness of the proposed approach with results that are competitive with the 
state of the art. 
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